Findings

Overview

This tab contained several visualization detailing the results of our project, and a write-up discussing potential implications. It contains thoughts for how to expand upon this project in the future, and ends with a short conclusion.

Our Results

After our data had been through our pipeline, we were able to confidently label 308,654 posts (including both original submissions and comments) across Reddit as partisan, leaning either Democrat or Republican. An additional 9,742,082 posts contained our keywords and were ran through sentiment analysis, but could not be categorized to our standards.

Number of Posts by Month

We can see the month-by-month breakdown of these strongly partisan posts below:

From this we can glean that, overall, the number of Reddit posts with strongly partisan leanings identifiable by our process per party does not seem drastically different month to month; that is, in a given month, we will tend to see comparable numbers of partisan posts from both parties, regardless of any increase in the number of identifiably partisan posts made in a period (try saying that sentence aloud). This suggests that, when intentionally excluding the highly partisan subreddits mentioned at the beginning of this process, overall partisan content on Reddit is not strongly skewed one direction or the other. Political posts seem to also rise and fall together in number regardless of party. There are a few interesting things to highlight from this graph, however:

  • Both parties peaked in aggregate number of posts in June, suggesting either a high rate of partisan posting or a stronger positive/negative sentiment of posts that made them identifiable to us. Either there were more posts on political subjects, or tempers were high.

  • November saw a very high rate of positive posts concerning Republicans (or the contra, negative posts concerning Democrats), especially when compared to the declining number of Democratic leaning posts since September, when Republican posting was at a significant low after the height of the summer months.

  • Both March and September saw somewhat abberant behavior, at least when compared to previous months, in that we saw a spike in pro-Democrat / anti-Republican postings unaccompanied by a similar spike in pro-Republican / anti-Democrat posts.

The Reddit Thermometer

Now that we have taken a look at the partisanship of postings by monthly totals, let’s build our thermometer with our approval ratings and compare it to the polls!

Here are our results after aggregating the Reddit data as described in the “Reddit Results Extraction” section:

So that it is fresh in your head, here is that aggregation trend line FiveThirtyEight produced for the 2022 election again:

We’ve added a yellow bounding box for the data concerning only 2022, which is the data we used in our Reddit thermometer creation. Let’s compare this to our findings for the same period.

This is quite remarkable to us. While both the Reddit data and FiveThirtyEight’s aggregated polling appears to show a trend towards Republican popularity increasing right before the election in November, our findings disagree significantly from what the polling suggested during the rest of the year. Reddit as a whole seems to fluctuate a bit on positivity towards each party month by month, defying the smooth trends we see in FiveThirtyEight’s graph. Additionally, while an inversion in popularity occurs on both graphs in the early fall, the Democratic popularity soars on Reddit in September, which is not captured at all in this graph of FiveThirtyEight’s. Perhaps there are some key events that may add context to these differences.

Here are our results again, this time with the overlay of three key dates that we suspected may have been large factors in the election:

We suspected at the outset of this project that three events would have generated high amounts of chatter around the election in the lead-up to the midterms: Russia’s invasion of Ukraine, the leak of an upcoming Supreme Court decision that would repeal of Roe v Wade, and the filings in New York suing former President Trump.

While we cannot be certain without a closer analysis of the posts made during these specific timeframes, we suspect that these events did have an effect on the popularity of both parties on Reddit. Democratic supporters are more abundant on Reddit immediately following the Russian invasion of Ukraine, but then fall off in relation to Republicans as the months go on, perhaps indicating a frustration of Democratic President Joe Biden’s handling of the crisis, or a lack of fervor following the actual event and initial outpouring of sentiment.

We see what could be a smaller and possibly delayed reaction in terms of swing in approval around the leak of the Dobbs decision, although we cannot be certain the small spike in June’s approval ratio for Democrats is due to the timing of this information the month prior. This was also when the January 6th hearings were being televised over the summer, which galvanized folks of both parties, making it hard to separate out the two events. Given the high volume of identifiably partisan posts in June, we expect that both events created a flurry of activity over the summer months from both sides, with Democrats being slightly more prolific in terms of the amount of strongly partisan posts.

It does seem likely to us that the massive peak in September is related to the release of the news surrounding the suits against Trump. While again we cannot be certain without more of an in-depth analysis, this seems to be a significant turning point - Democratically partisan posts were high, second only to the activity peak already discussed in June, whereas Republican partisan posts were quite low, the lowest they’d been since February. We have several theories for this!

This September spike may have been because pro-Republican postings took place more on the removed highly partisan subreddits, whereas pro-Democratic postings felt they had more free reign to discuss partisan topics in more diverse subreddits. We could also be seeing this result due to degree of feeling: if, for instance, Democratic posts were more strongly positive whereas posts containing Republican keywords were more subdued, this could have resulted in our labelling logic missing the quieter pro-Republican posts. This could also indicate that posters who would normally be posting positively about right-leaning keywords would be critiquing former President Trump’s behavior and moving across party lines. And finally, a fourth theory admits there could also be a flaw in the labelling logic itself: if a post was expressing strong disgust over the indictments as being politically motivated, but did not mention keywords other than those on the right, this could have been erroneously sorted into our pro-Democratic category when labelled.

Comparison of the Reddit Thermometer to Unweighted Polls

We’ve been frequently comparing our results to the weighted polling aggregates that FiveThirtyEight created above - but what happens if we take a look at the polling data directly ourselves? Here we can see how our Reddit thermometer compares to the aggregated polling data across the same time period without any weighting from FiveThirtyEight:

Now, this is quite different! The graph above has in the heavy non-dashed lines the straight averages month by month of every poll in the FiveThirtyEight dataset. These are not weighted for accuracy, for reach, even for sample size; this is just a straight average of folks who preferred Republicans vs folks who preferred Democrats. This is overlaid atop our Reddit thermometer in dashed and lighter lines.

Even when simply comparing this to FiveThirtyEight’s graph from above, there are some interesting differences. During this same period, FiveThirtyEight’s weighting shows Republicans as ahead more often than not, whereas when taken as a simple average, the Democrats seem to be leading most of the months right before the election, with the gap narrowing in November. This suggests that part of why the Red Wave was predicted yet failed to manifest is an over-emphasizing of polls that skewed towards preferring Republicans. But that is a question for a different group of people to debate.

In terms of how this un-weighted average aligns with our data, we see some similarities but also some differences. The massive spike we see in September seems to be less out of the norm (if still surprising in magnitude) when viewed against the slow gentle rise in Democratic popularity since the start of the summer. This suggests that our earlier findings - that the combination of Dobbs and the January 6th hearings may have pushed more folks to view the Democrats favorably and the Republicans negatively - may have been reflected in several polls, especially when viewed in this unweighted manner.

Future Avenues of Exploration

Now that we have finished our results, we’d like to end with a brief look at possible ways to expand or refine this project.

To start with, we’d like to re-attempt this project with more sophisticated labelling logic. As you can see from the tab discussing the labelling section of the project, we were very tentative in labelling our data. If we could not be strongly certain that a post (either a submission or a comment) was able to be grouped into one camp or another, it was put into the “No Party” bucket, which then became a catch-all for any posts were not easily sorted by the binary “this or that” system. Given more resources, we think this could be expanded upon to include more refined labelling logic that has more branches (perhaps the random forest to our current decision-tree-esque process).

We would also suggest a re-run of this process with the additional step of aggregating posts by user before applying a label. This would have a dual effect of reducing the number of rows to work with (although perhaps also widening the data quite a bit), and of being able to potentially “tie” a political but non preferential posting to a Democratic- or Republican-leaning author. This would also make the direct comparison to polls a bit more justifiable to a layperson as, after all, people are the ones that cast a ballot, not posts.

While we have great confidence in our keywords, and in the choice to initially lemmatize many of them in order to cast the widest possible net at the outset of our project, in another iteration of this same project we would have sorted more of the keywords into left- or right-leaning camps, and perhaps into more than two simple categories. You can see an early hint of this in our labelling logic, where a select few generic keywords are pulled out to be considered in a partisan manner, but this process could use an expansion given more time and expertise. Perhaps a weighting system prioritizing some keywords over others would be superior to our binary “is the word present” analysis.

And lastly, given more resources to run more queries, we would have liked to add a geographical component to further segregate the data by the political race being discussed. This could have allowed for some fine-tuning in the poll comparison, allowing us to look at multiple races at once instead of the overall trendline, perhaps to see if there were races that skewed the result or that drew more attention than other. This would likely have to involve the re-inclusion of our previous excluded subreddits, like r/Democrats and r/Republicans, with some sort of weighting system again, this time to limit the skew of the heavily partisan nature of these subreddits.

In Conclusion

Overall, we are quite please with this project and with our findings. We are content with the performance of the sentiment analysis model from Jon Snow Labs. We have confidence in our labelling logic as successfully identifying if a post is strongly partisan in favor of one party or another. We are happy to see that our model for labelling future Reddit posts based on our pipeline is fairly accurate and much more easily deployable than the many jobs we ran to get from A to B. And we are delighted that, after all that, we were able to produce a metric that could be succesfully compared to real-life polling to better understand the results of the election in 2022.

In general, it seems that Reddit was more accurate than the aggregated polling in the year before the election - or, if not more accurate, at least less wrong. The back-and-forth observed each month between numbers of posts speaking positively on one party or another indicates to us a high amount of dialogue concerning the ideologies of both parties, and no clear winner emerges on this Reddit stage across multiple months. Reddit, at least, did not seem poised to flood the elections with a Red Wave. That could be a significant part of why the Red Wave did not materialize as predicted: the people having the conversations with one another, convincing one another, were not being reflected in the polls. Or at least, in the weighted averages of the polls.

In conclusion: you might be able to trust the polls, but you probably can’t trust Nate Silver.